Dota Science

- A tutorial in using the OpenDota API and analyzing, wrangling, and interpreting data

Dota2 is an online multiplayer game categorized as a MOBA. The complexities run deep in this 8 year old game and with constant updates and changes, there is no definite "best" way to play. Playstyle is determined by many factors such as desire to win, currenty meta, and even personality. While some play for fun, most play to win. With a health competetive scene, some players struggle to improve their rank and often lay the blame on their team members.

In this guide we will learn how to use the OpenDota API to gather statistics about matches and players. Furthermore, we will also perform lots of data wrangling to take a somewhat cluttered API return into clean and useful data.

What actually matters in Dota? Are you focusing on the wrong things? Rather than compare yourself to others, this guide focuses on finding which stats actually have an effect on your wins and losses.

A big thank you to our volunteer who allowed me to analyze their recent matches and use their account as a demo.

Using the OpenDota API

https://docs.opendota.com/#tag/matches%2Fpaths%2F~1matches~1%7Bmatch_id%7D%2Fget

OpenDota is a website that provides its very own Dota API. While valve has one as well, we will be focusing on OpenDota for this guide.

Lets begin by installing and importing the necessary packages.

Here we make our first request to ensure the API is working.

Then we proceed to save the data and import it into a Pandas Dataframe

Immediately we see some issues with our Dataset. Lots of the older matches have FAR less data and honestly won't provide much insight anyway as it was 8 years ago as well. We will begin by simply performing a dropna to remove all NaN points.

For some reason, there is no team indication other than an 8 bit number. This will be annoying to constantly convert so lets make it easier for ourselves.

Additionally, lets bring in a win/loss value

Party Size

Lets start off with something super simple just to get a feel for how our data works/looks. It would not be fun if we setup a complex model only to realize something is wrong.

Lets purely analyze winrate based off of your party size.

5 Stacks are clearly an outlier, but apparently a high outlier. Lets remove 4 and 5 and take a look at more normal matches.

While the difference is small, is is noticeabe, solo queuing has almost a 3% advantage over trio queuing.

Better Data

We're really only given so much data from this request. Fortunately, we have another api call that requests a Match ID, and returns large amounts of data and an entire array of all players in the game. The work we've done so far is still helpful because the new json returns in the form of sub array of player positions which we now already have.

Unfortunately, the API has been giving me lots of issues lately with being abnormally slow and returning quite a few errors. For this reason we only take a random sample of 100 matches. Feel free in your project to use as many data points as you want, the more the merrier.

Also the free version of the API only allows 60 calls a minute, so I went ahead and purchased an API key allowing 1500 calls a minute for easier testing/debugging of this tutorial. The key you see will no longer work as I have reset mine after posting this tutorial.

We are now almost where we were previously. Lets extract the data for our player in question.

Since we're just looking at our one player we need to extract all the information from his respective player position/category

Good Work! We are now almost done with some of the biggest data wrangling portion. Lets replace some NaN values with 0s along with bringing our previously computed win/loss back in.

Vision

Lets graph some more complex data with our newly wrangled Data.

Vision is a crucial part to any Dota match. Playing in the dark can be painful, but do the numbers tell the same story?

From this violin plot alone, we can hardly say vision matters. An interesting observation. Lets try to back it up.

Building Models

Much like Valve's already implemented dota plus win predictor, we can create our own ML model using different parameters to determine an outcome.

Linear Regression

Lets start by predicting win/loss based off of vision like we previously did. We will need a dummy variable for winloss.

For the following Linear Regression Models we will be using stasmodel ols. Lets see if a linear model confirms out hypothesis. https://www.statsmodels.org/stable/index.html

Much like our previous graph we can see that surprisingly enough ward purchases don't have a massive impact on the result of the game. So much so that neither of the variables, nor the interaction is statistically significant.

Lets explore some more variables. Feel free to try out other ones too that you are curious about.

This is very intuitive as well. We see assists and deaths mattering a lot. But kills not as much which makes sense for our support player.

In a game like dota, arguably one of the most important factors is the hero you play. While we could just look at hero winrate, that is boring, so lets make a model based off of hero-id

Without a staggering winrate on any hero, a linear regression model will never show that one hero is more significant than the other. Since most hero winrates average around 50% on this account, the linear regression model sees them as very insignificant.

Lets try a super broad attack to see if we can discover any unexpected information.

Additionally, note that descriptions for these tags can be found here http://sharonkuo.me/dota2/matchdetails.html

GPM, hero dmg, healing, all make sense. They're important and you're doing them more if you're winning. Something to possibly note is the significance of stacking camps. It is significant HOWEVER, it has a negative weight, implying it is actually worse.

Another thing which is interesting is the amount of pinging is not relevant. Some may be surprised by this and some may not.

Decision Trees

A more intuitive way to create a predictive model would be with a decision tree and random forest. Since we had to use a dummy var for win/loss is may be slightly skewing all of our results.

Lets grab some features that we believe are good indicators. We will be using SKLearn to build these models. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Method is based off guide here(https://www.datacamp.com/community/tutorials/decision-tree-classification-python)

We will be using a 70/30 holdout validation to assess accuracy

The decision tree has given us a medium accuracy when holdout validation is used. Next lets compare to a random forest. Additionally, we used all default params/hyperparams.

75% accuracy is decent considering how much variation there is between these values between games

Random Forests

Next up we will try a random forrest. Hypothetically, this should become more accurate. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

We will use the same 70-30 holdout validation here again to get an even higher accuracy.

As expected the random forest is better, but a 10% increase is very significant and makes sense considering what we mentioned earlier. Each dota game is very different and hard to classify purely off of 6-7 stats

Analyzing and Drawing Conclusions from our Data

Now that we've covered how to collect, wrangle, and analyze data, we need to draw some conclusions about what we've seen. In the case of this tutorial, we covered a wide variety of stats and either confirmed or rejected if they were relevant on the outcome of the game. In your case, you may be curious about other factors and interpret the results differently.

For example, stacking_camps ended up being relevant by our linear model. This isn't too unexpected, but the fact that it had a negative weight is. This implies that for some reason, the player in question is hurting themself by stacking camps. Why is that? Could it be because the enemy team is able to take the stacks before you are? Are your stacks getting stolen? Is it because you're leaving to stack at the wrong time and its hurting your team more than helping?

All these questions are productive and will help make you a better player. They will allow you to discover and improve on traits you may not have even questioned.

Looking forward, there are endless possibilities on how to use the DotaAPI to collect and analyze data. Another interesting factor might be chatlogs. I avoided them in this tutorial as they can be pretty toxic but how does chatting affect your gameplay? Are you losing more when you chat more? Are there certain words you say when you're losing?

Overall, the possibilities are endless and its up to you how you want to use this information. Remember, its important to follow a semi-rigid process of collect, wrangle, analyze, reflect.

Thank you for reading my tutorial. I hope you found it interesting and informative. If you have any questions or would simply like to learn more, feel free to contact me at jack.maiorino@gmail.com